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Abstract. This paper describes an efficient reduction of the learning 
problem of ranking to binary classification. The reduction guarantees 
an average pairwise misranking regret of at most that of the binary 
classifier regret, improving a recent result of Balcan et al which only 
guarantees a factor of 2. Moreover, our reduction applies to a broader 
class of ranking loss functions, admits a simpler proof, and the expected 
running time complexity of our algorithm in terms of number of calls to 
a classifier or preference function is improved from i7(n^) to O(nlogn). 
In addition, when the top k ranked elements only are required (fc <C n), 
as in many applications in information extraction or search engines, the 
time complexity of our algorithm can be further reduced to 0{k log k+n) . 
Our reduction and algorithm are thus practical for realistic applications 
where the number of points to rank exceeds several thousands. Much of 
our results also extend beyond the bipartite case previously studied. 
Our rediction is a randomized one. To complement our result, we also 
derive lower bounds on any deterministic reduction from binary (pref- 
erence) classification to ranking, implying that our use of a randomized 
reduction is essentially necessary for the guarantees we provide. 



1 Introduction 

The learning problem of ranking arises in many modern applications, including 
the design of search engines, information extraction, and movie recommendation 
systems. In these applications, the ordering of the documents or movies returned 
is a critical aspect of the system. 

The problem has been formulated within two distinct settings. In the score- 
based setting, the learning algorithm receives a labeled sample of pairwise pref- 
erences and returns a scoring function /:[/—> R which induces a linear ordering 
of the points in the set U. Test points are simply ranked according to the val- 
ues of h for those points. Several ranking algorithms, including RankBoost [13, 
21], SVM-type ranking [17], and other algorithms such as PRank [12,2], were 
designed for this setting. Generalization bounds have been given in this setting 



for the pairwise misranking error [13,1], including margin-based bounds [21]. 
Stability-based generalization bounds have also been given in this setting for 
wide classes of ranking algorithms both in the case of bipartite ranking [2] and 
the general case [11, 10]. 

A somewhat different two-stage scenario was considered in other pubhcations 
starting with Cohen, Schapire, and Singer [8], and later Balcan et al. [6], which 
we will refer to as the preference-based setting. In the first stage of that setting, 
a preference function h : U x U i-^ [0,1] is learned, where values of h{u, v) closer 
to one indicate that v is ranked above u and values closer to zero the opposite. 
h is typically assumed to be the output of a classification algorithm trained 
on a sample of labeled pairs, and can be for example a convex combination of 
simpler preference functions as in [8]. A crucial difference with the score-based 
setting is that, in general, the preference function h does not induce a linear 
ordering. The order it induces may be non-transitive, thus we may have for 
example h{u,v) — h(v,w) = h(w,u) = 1 for three distinct points u, v, and w. 
To rank a test subset V C U, in the second stage, the algorithm orders the points 
in V by making use of the preference function h learned in the first stage. 

This paper deals with the preference-based ranking setting just described. 
The advantage of this setting is that the learning algorithm is not required to re- 
turn a linear ordering of all points in U, which is impossible to achieve faultlessly 
in accordance with a true pairwise preference labeling that is non-transitive. This 
is more likely to be achievable exactly or with a better approximation when the 
algorithm is requested instead, as in this setting, to supply a linear ordering, 
only for a limited subset V C U. 

When the preference function is learned by a binary classification algorithm, 
the preference-based setting can be viewed as a reduction of the ranking problem 
to a classification one. The second stage specifies how the ranking is obtained 
using the preference function. 

Cohen, Schapire, and Singer [8] showed that in the second stage of the 
preference-based setting, the general problem of finding a linear ordering with 
as few pairwise misrankings as possible with respect to the preference function 
h is NP-complete. The authors presented a greedy algorithm based on the tour- 
nament degree for each element u £ V defined as the difference between the 
number of elements u is preferred to versus the number of those preferred to 
u. The bound proven by these authors, formulated in terms of the pairwise dis- 
agreement loss I with respect to the preference function h, can be written as 
Kc^greedy, h) < 1/2 -|- l{aoptiniahh) /2, where l{(Tgreedy, h) is thc loss achicvcd by 
the permutation a greedy returned by their algorithm and I {u optimal ih) the one 
achieved by the optimal permutation (JopUmai with respect to the preference 
function h. This bound was given for the general case of ranking, but in the 
particular case of bipartite ranking (which we define below), a random order- 
ing can achieve a pairwise disagreement loss of 1/2 and thus the bound is not 
informative. 

More recently, Balcan et al [6] studied the bipartite ranking problem and 
showed that sorting the elements of V according to the same tournament degree 



used by [8] guarantees a pairwise misranking regret of at most 2r using a binary 
classifier with regret r. However, due to the quadratic nature of the definition 
of the tournament degree, their algorithm requires fiip?) calls to the preference 
function h, where n = \V\ is the number of objects to rank. 

We describe an efficient algorithm for the second stage of preference-based 
setting and thus for reducing the learning problem of ranking to binary classi- 
fication. We improve on the recent result of Balcan et al. [6], by guaranteeing 
an average pairwise misranking regret of at most r using a binary classifier with 
regret r. In other words, we improve their constant from 2 to 1. Our reduction 
applies (with different constants) to a broader class of ranking loss functions, 
admits a simpler proof, and the expected running time complexity of our algo- 
rithm in terms of number of calls to a classifier or preference function is improved 
from Q(in?) to 0(n\ogrL). Furthermore, when the top k ranked elements only 
are required (fc <C n)^ as in many applications in information extraction or 
search engines, the time complexity of our algorithm can be further reduced to 
0(fclogfc 4- n). Our reduction and algorithm arc thus practical for realistic ap- 
plications where the number of points to rank exceeds several thousands. Much 
of our results also extend beyond the bipartite case previously studied by [6] to 
the general case of ranking. A by-product of our proofs is also a bound on the 
pairwise disagreement loss with respect to the preference function h that wc will 
compare to the result given by Cohen, Schapire, and Singer [8]. 

The algorithm used by Balcan et al. [7] to produce a ranking based on the 
preference function is known as sort-by-degree and has been recently used in the 
context of minimizing the feedback arcset in tournaments [9]. Here, we use a 
different algorithm, QuickSort, which has also been recently used for minimizing 
the feedback arcset in tournaments [4,3]. The techniques presented make use 
of the earlier work by Ailon et al. on combinatorial optimization problems over 
rankings and clustering [4,3]. 

The remainder of the paper is structured as follows. In Section 2, we introduce 
the definitions and notation used in future sections and introduce a family of 
general loss functions that can be used to measure the quality of a ranking 
hypothesis. Section 3 describes a simple and efficient algorithm for reducing 
ranking to binary classification, proves several bounds guaranteeing the quality of 
the ranking produced by the algorithm, and shows the running-time complexity 
of our algorithm to be very efficient. In Section 4 we discuss the relationship of the 
algorithm and its proof with previous related work in combinatorial optimization. 
In Section ?? we derive a lower bound of factor 2 on any deterministic reduction 
from binary (preference) classification to ranking, implying that our use of a 
randomized reduction is essentially necessary for the improved guarantees we 
provide. 

2 Preliminaries 

This section introduces several preliminary definitions necessary for the presen- 
tation of our results. In what follows, U will denote a universe of elements (e.g. 



the collection of all possible query-result pairs returned by a web search task) 
and V C U will denote a small subset thereof (e.g. a preliminary list of relevant 
results for a given query). For simplicity of notation we will assume that U is 
a set of integers, so that we are always able to choose a minimal (canonical) 
element in a finite subset (as we do in (9) below). This arbitrary ordering should 
not be confused with the ranking problem we arc considering. 

2.1 General Definitions and Notation 

We first briefly discuss the learning setting and assumptions made by Balcan 
et al.'s [7] and Cohen ct al. [8] and introduce a consistent notation to make it 
easier to compare our results with that of this previous work. 

Ground truth In [7], the ground truth is a bipartite ranking of the set V 
of elements that one wishes to rank."^ A bipartite ranking is a partition of V 
into positive and negative elements where positive elements are preferred over 
negative ones and elements sharing the same label are in a tie. This is a natural 
setting when the human raters assign a positive or negative label to each element. 
Here, we will allow a more general structure where the ground truth is a ranking 
cr* equipped with a weight function uj, which can be used for encoding ties. The 
bipartite case can be encoded by choosing a specific lo as we shall further discuss 
below. 

In [8], the "ground truth" has a different interpretation, which we briefly 
discuss in Section 3.4. 

Preference function In both [8] and [7], a preference function /i : UxU ^ [0, 1] 
is assumed, which is learned in a first learning stage. The convention is that the 
higher h{u,v) is, the more our belief that u should be ahead of v. The function 
h satisfies pairwise consistency: h{u,v) + h{v,u) = 1, but need not even be 
transitive on 3-tuples. The second stage uses h to output a proper ranking cr, as 
we shall further discuss below. The running time complexity of the second stage 
is measured with respect to the number of calls to h. 

Output of learning algorithm The final output of the second stage of the 
algorithm, cr, is a proper ranking of V . Its cost is measured differently in [7] 
and [8] . In [7] , it is measured against cr* and compared to the cost of h against 
cr*. This can be thought of as the best one could hope for if h encodes all the 
available information. In [8] , cr is measured against the given preference function 
/i, and compared to the best one can get. 

^ More generally, the ground truth may also be a distribution of bipartite rankings, 
but the error bounds both in our work and that of previous work are achieved by 
fixing one ground truth and taking conditional expectation as a final step. Thus, we 
can assume that it is fixed. 



2.2 Loss Functions 

We are now ready to define the loss functions used to measure the quaUty of an 
output ranking a either witli respect to a* (as in [7]) or witli respect to h (as in 



Let F C f7 be a finite subset that we wish to rank and let 5(1^) denote the set 
of rankings on V, that is the set of injections from V to [n] = {1, . . . , n}, where 
n = \V\. If (T £ S{V) is such a ranking, then a{u) is the rank of an element 
u & V, where "lower" is interpreted as "ahead". More precisely, we say that 
u is preferred over v with respect to a if cr(ii) < cr(u). For compatibility with 
the notation used for general preference functions, we also write (t(u,u) = 1 if 
(7{u) < (j(v) and (t(m, v) = otherwise. 

The following general loss function L^^ measures the quality of a ranking a 
with respect to a desired one a* using a weight function lu (described below): 



The sum is over all pairs u,v in the domain V of the rankings a, a*. It counts 
the number of inverted pairs u,v & V weighed by lo, which assigns importance 
coefficients to pairs, based on their positions in a*. The function uj must satisfy 
the following three natural axioms, which will be necessary in our analysis: 

(PI) Symmetry: uj{i, j) — uj{j,i) for all 

(P2) Monotonicity: Ld{i^i) < oj{i, k) if either i < j < k ot i > j > k; 
(P3) Triangle inequality: Lj{i,j) < uj{i, k) + Lo{k,j). 

This definition is very general and encompasses many useful, well studied dis- 
tance functions. Setting — 1 for all i ^ j yields the unweighted pairwise 
misranking measure or the so-called Kemcny distance function. 
For a fixed integer fc, the following function 



can be used to emphasize ranking at the top k elements. Misranking of pairs with 
one element ranked among the top k is penalized by this function. This can be of 
interest in applications such as information extraction or search engines where 
the ranking of the top documents matters more. For this emphasis function, all 
elements ranked below k are in a tie. In fact, it is possible to encode any tie 
relation using lu. 

The loss function considered in [6] can also be straightforwardly encoded 
with the following emphasis bipartite function 



[8]) 




(1) 







Items in positions 1 , . . . , A; for the permutation a can be thought of as the positive 
items (in a tie), and those in /s + 1, . . . , |V^| as negative (also in a tie). This choice 
coincides with (1— AUG), where AUG is the area under the ROG curve commonly 
used in statistics and machine learning problems [14, 19]. 

Glearly, setting uj{i,j) ~ \s{i) — s{j) \ for any monotone score hmction s works 
as well. It is well known though that such a function can in fact be expressed as 
a convex combination of functions of the type (3). Hence, a bipartite function 
should be thought of as the simplest 

In general, one may wish to work with a collection of ground truths crjj' , . . . , ti^ 
and weight functions toi, . . . ,ujn and a loss function which is a sum over the 
individual losses with respect to a* , uji, e.g. in meta searching^. Since our bound 
is based on the expected loss, it will straightforwardly generalize to this setting 
using the linearity of expectation. Thus, we can assume without loss of generality 
a single ground truth a* equipped with a single lo. 

Preference Loss Function We need to extend the definition to measure the 
loss of a preference function h with respect to a*. Recall that h(u,v) is In 
contrast with the loss function just defined, we need to define a preference loss 
measuring a ranking's disagreements with respect to a preference function h. 
When measured against a*, to, the function L^j can be readily used: 



We use L to denote for the special case where uj is the constant fimction 1, 
w = 1: 



The special case of L coincides with the standard pairwise disagreement loss of 
a ranking with respect to a preference function as used in [8]. 

2.3 The Special Bipartite Case 

A particular case of interest is when lu belongs to the family of weight functions 
defined in (3). For this particular case we will use a slightly more convenient 
notation. For a set of elements V C U, Let Uy denote the set of partitions of V 
into two sets (positive and negative). More formally, r G Uy is a function from 
V to {0, 1} (where should be thought of as the preferred or positive value, 
and 1 the negative; we choose this convention so that t{u) can be interepreted 
as the rank of u, where there are two possible ranks). Abusing notation we say 
that t(u, u) = 1 if t(u) < t{v) [u is preferrede over v) and r(u, v) = otherwise 
(note that here we can have t(u, v) = t{v, u) = 0). Our abuse of notation allows 

* The reason we have separate weight function lo's is e.g. each search engine may 
output top-fc outputs for different values of k. 







us to use the readily defined function L to measure the loss of a ranking a g Sy 
against r* G Uy (which will usually take the role of a ground truth): 




^a{u,v)T*{v,u) . 



Note that this coincides with Li^{a,a*), where a* is any ranking on V with 
(T*{u) < a*{v) whenever t*{u) < r*{v), and lu is as in (3) with 



A note on normalization: The bipartite case is the one considered in [6], with 
a small different which is crucial for some of the bounds we derive. There, the 
loss function is defined as 



If we are working with just one r*, the two loss functions are the same up to a 
constant. However, if we have a distribution over r* and consider the expected 
loss, then there may be a difference. For simplicity we will work with the defini- 
tion derived from (4) . and will leave the other choice for discussion in Section 4. 

2.4 Independence on Irrelevant Alternatives and Regret Functions 

The subset V is chosen from the universe U from some distribution. Together 
with V, a ground truth ranking a* £ S{V), and an admissible weight function uj 
are also chosen randomly. We let D denote the distribution on V,a*,uj. (In the 
bipartite case, Z? is a distribution on V and on t* £ Uy.) 

Definition 1. A distribution D on V,a* ,uj satisfies the pairwise independence 
on irrelevant alternatives (IIA) property if for all distinct u,v gU, conditioned 
on u,v V the random variables a*{u,v)Lo{u,v) and V\{u,v} are independent. 

In the bipartite case this translates to 

Definition 2. A distribution D on V, r* satisfies the pairwise IIA property if 
for all distinct u,v G U, conditioned on u,v £V the random variables t*(u,v) 
and V \ {u, v} are independent. 

Note that in the bipartite case, D can satisfy pairwise IIA while not satisfying 
pointwise IIA. (Pointwise IIA means that conditioned on u G V, t*(u) and 
V \ {u} are independent.) In certain applications, e.g. when ground truth is 
obtained from humans, it is reasonable not to assume pointwise IIA. Think of the 
"grass is greener" phenomenon: a satisfactory option may seem unsatisfactory 
in the presence of an alternative. Continuing the analogue, assuming pairwise 
IIA means that choosing between two options does not depend on the presence 
of a third alternative. (By choosing we mean that ties are allowed.) 



k ^ \{u e V : T*{u) = 0} 



{u, V : T 



{u)<T*{v)}\ ^^(t{u,v)t*{v,u) . 



(6) 



In this work we do not assume pointwise IIA, and when deriving loss bounds 
we will not assume pairwise IIA either. We will need pairwise IIA when working 
with regret, which is an adjustment of the loss designed so that an optimal 
solution would have a value of with respect to the ground truth. As pointed 
out in [6] , the regret measures the loss modulo " noise" . 

Using regret (here) makes sense when the optimal solution has a strictly 
positive loss value. In our case it can only happen if the ground truth is a proper 
distribution, namely, the probability mass is not concentrated on one point. 

To define ranking regret, assume we are learning how to obtain a full ranking 
a of V, using an algorithm A, so that a = As{V), where s is a random stream of 
bits possibly used by the algorithm. For ranking learning, we define the regret 
of A against D as 

Rra7ikiA,D) = Ev,cr-.uj.s[Lu;{As{V),a*)] ~ min £'y,CT'.c^[ic^(o'|v, cr*)] , 
■ ■ aeS{U) 

where CT|y £ S{V) is defined by restricting the ranking a G S{U) to in a 
natural way. 

In the preference classification setting, it makes sense to define the regret of 
a preference function h : U x U ^ {0, 1} as follows: 

Rciasaih,D) = Ev,a'-,uj[Luj{h\v,o-*)] -rniiiEv,a',uj[Luj(hiv,o-*)] > 

h 

where the minimum is over h a preference function over U, and -[y is a restriction 
operator on preference functions defined in the natural way. For the bipartite 
special case, we have the simplified form: 

Rrank{A,D) = Ev.r'AL{MV),T*)] - Uliu Ey.r' [Lidlv , T*)] (7) 

Rciassih,D) = Ev,r'[Lih\v,T*)] - mmEv.r'[Likv,r*)] . (8) 

h 

The regret measures how well an algorithm or a classifier performs compared 
to the best "static" algorithm, namely, one that ranks U in advance (in Rrank) 
or provides preference information on U in advanced (in Rciass)- Note that the 
minimizer h in (8) can be easily found by considering each u,v <E U separately. 
More precisely, one can take 

{1 Er4T*{u,v)\u,V GV]> Er-'[T*{V,U)\U,V G V] 

Er-[T*{U,V)\U,V €V] < Er-'[T*{V,U)\U,V eV] (9) 

lM>ti otherwise (equality) 

Now notice that if D satisfies pairwise IIA, then for any set Vq containing 
u,v, 

Er*[T*{u,v)\V = Vo] = Er-[T*{U,V)\U,V G V] . 

Therefore, in this case the min^ and Ey operators commute: 



min Ev,T'[L{h\Y , T*)] = Ey min ET-'[L{fny,T*)] 

h h 



For our analysis it will indeed be useful to swap the min and Ey operators. We 
define 

K,^,{A,D) ^ Ev.r^AHMV),rn]~ Ey mm EML{a,T*)] (10) 
KiassiKD) = Ev,AL{hiv,T*)] - EvminE^,[Lih,T*)] , (11) 

h 

where now min^ is over preference functions ft, on V^. We summarize this section 
with the following: 

Observation 1 1. In general (using the concavity o/min and Jensen's inequal- 

ity): KanAA,D)>Rrank{A,D); 
2. Assuming pairwise IIA: R'dassi^^ ^) ~ Rciass{h, D) . 

3 Algorithm for Ranking Using a Preference Function 

This section describes and analyzes an algorithm for obtaining a global ranking 
of a subset using a prelearned preference function h, which corresponds to the 
second stage of the preference-based setting. Our bound on the loss will be 
derived using conditional expectation on the preference loss assuming a fixed 
subset V C U, and fixed c* and uj. To further simplify the analysis, we assume 
that h is binary, that is h{u, v) G {0, 1} for all u,v & U. 

3.1 Description 

One simple idea to obtain a global ranking of the points in V consists of using 
a standard comparison-based sorting algorithm where the comparison operation 
is based on the preference function. However, since in general the preference 
function is not transitive, the property of the resulting permutation obtained is 
unclear. 

This section shows however that the permutation generated by the standard 
Quicksort algorithm provides excellent guarantees.^ Thus, the algorithm we sug- 
gest is the following. Pick a random pivot element u uniformly at random from 
V. For each v ^ u, place v on the left^ of u if h{v, u) = 1, and to its right other- 
wise. Proceed recursively with the array to the left of u and the one to its right 
and return the concatenation of the permutation returned by the left recursion, 
u, and the permutation returned by the right recursion. 

We will denote by Qs{V) the permutation resulting in running QuickSort 
on V using preference function /i, where s is the random stream of bits used 
by Quicksort for the selection of the pivots. As we shall see in the next two 
sections, on average, this algorithm produces high-quality global rankings in a 
time-efficient manner. 

We are not assuming here transitivity as in most textbook presentations of Quick- 
Sort. 

® We will use the convention that ranked items are written from left to right, starting 
with the most preferred ones. 



3.2 Ranking Quality Guarantees 

The following theorems give bounds on the ranking quality of the algorithm 
described, for both loss and regret, on the general and bipartite cases. 

Theorem 2 (Loss bounds in general case). For any fixed subset V C U, 
preference function h on V , ranking a* £ S{V) and admissible weight function 
u the following bound holds: 

E[L^(g^(F),a*)] <2L^(/i,a*) (12) 

S 

Note: This implies by the principle of conditional expectation that 

E [L^{Q'l{V),a*)] < 2ED[LUh,a*)] (13) 

D,s 

(where h can depend on V). 

Theorem 3 (Loss and regret bounds in bipartite case). For any fixed 
V <Z U , preference function h over V and t* G n{V), the following bound holds: 

nL{Q'i(y),T*] = L{h,T*) . (14) 

// y, T* are drawn from some distribution D satisfying pairwise IIA, then 

Rra«fc(gi'(-),^) < ^class{h,D) (15) 

Note: Equation (14) implies by the principle of conditional expectation that 
if V, r* are drawn from a distribution _D, then 

^[L{Q^I{V),t*)]^Ed[L{Kt*)] (16) 

D.s 

(where h can depend on V). 

To prove these theorems, we must first introduce some tools to help analyze 
Quicksort. These tools were first developed in [4] in the context of optimization, 
and here we initiate their use in learning. 

3.3 Analyzing QuickSort 

Assume V is fixed, and let Qs = Q^{V) be the (random) ranking outputted by 
Quicksort on V using preference function h. During the execution of QuickSort, 
the order between two points u,v is determined in one of two ways: 

— Directly: u (or v) was selected as the pivot with v (resp. u) present in the 
same sub- array in a recursive call to QuickSort. We denote by Puv = Pvu 
the probability of that event. In that case, the algorithm orders u and v 
according to the preference function h. 



— Indirectly: a third element w € V is selected as pivot with w,u^v all present 
in the same sub-array in a recursive call to QuickSort, u is assigned to the 
left sub-array and v to the right (or vice- versa). 

Let Puvw denote the probability of the event that u, v, and w be present 
in the same array in a recursive call to QuickSort and that one of them be 
selected as pivot. Note that conditioned on that event, each of these three 
elements is equally likely to be selected as a pivot since the pivot selection 
is based on a uniform distribution. 

If (say) w is selected among the three, then u will be placed on the left of v if 
h(u,w) = h{w,v) = 1, and to its right if h(v,w) = h{w,u) = 1. In all other 
cases, the order between w, v will be determined only in a deeper nested call 
to Quicksort. 

Let X,Y : V X V M. he any two functions on ordered pairs u,v € V, 
and let Z : (^) i— > R be a function on unordered pairs (sets of two elements). 
By convention, we use X{u,v) to denote ordered arguments, and Yuv to denote 
unordered arguments. We define three functions Q![X, y] : (Y,) i-^ M, f3[X] : 
(^) M and 7[Z] : (1^) R as follows: 

a[X, = X{u, v)Y{v, u) + X{v, u)Y{u, v) 
f3[X]uy^ = -(h(u, v)h{v, w)X{w, u) + h{w, v)h{v, u)X{u, w)) 

+ —{h{v, u)h{u, w)X{'w, v) + h{'w, u)h{u, v)X{v, w)) 
o 

+ —{h{u, w)h{'w, v)X{v, u) + h{v, w)h{w, u)X{u, v)) ^^7) 

'y[Z]uvw = ^{h{'U;v)h{v,w) + h{w,v)h{v,u))Zutn 

+ —{h{v, u)h{u, w) + h{w, u)h{u, v))Zy.uj 
o 

+ —{h{u, w)h{'w, v) + h{v, w)h{w, u))Zuv ■ 
o 

Lemma 1 (QuickSort decomposition). 

1. For any Z : {\) ^ R, 

^ ^ ^uv — ^ ^ Puv^uv H~ ^ ^ Puvw'~f[^]uvw • 
u<v u<v u<v<iw 

2. For any X -.V xV ^ R, 

Es[^a[Qs,X]uv] =^^Puva[h,X]uv + ^ PuvwP[X]uvw ■ 

Proof. To see the first part, notice that for every unordered pair u < v the 
expression Z„„ is accounted for on the RHS of the equation with total coefficient; 

Puv+ ^ ^puvw{h{u,w)h{w,v) + h{v,w)h{w,u)) . 



Now, puv is the probability that the pair uv is charged directly (by definition), 
and ^Puvw{h{u, w)h{w,v) + h{v, w)h{w, u)) is the probability that the pair u, v is 
charged indirectly via w as pivot. Since each pair is charged exactly once, these 
probabilities are of pairwise disjoint events that cover the probability space. 
Hence, the total coefficient of Z.^v on the RHS is 1, as is on the LHS. The second 
part is proved similarly. 



3.4 Loss Bounds 

We prove the first part of Theorems 2 and 3. We start with the general case 
notation. The loss incurred by QuickSort is (as a function of the random bits 
s), for fixed (t*,w, clearly L^{Qs,(t*) = ^1™, where A : V x 

1^ I— !■ M is defined as A{u,v) = uj{a*{u),a*{v))a*{u,v) . By the second part of 
Lemma 1, the expected loss is therefore 

E[L„(Q,,fT*)] = (^p„„a[/i,Z\]™ + PuvwP[A]uv^v] ■ (18) 



Similarly, we have that L^{h,a*) ~ (2) ^u<v^[^^ ^i^-"- Therefore, using 
the first part of Lemma 1 , 

L^{h,(T*) ^ VYpuva[h,A]uv + ^ 7[q;[/i, Z\]]™u, j . (19) 

\u<v u<iv<w J 

To complete the proof for the general (non-bipartite) case, it suffices to show 
that for all u,v,w, l3[A]uvw < 27[a[/i, Up to symmetry, there are two 

cases to consider. The first case assumes h induces a cycle on u,v,w, and the 
second assumes it doesn't. 

1. Without loss of generality, assume h{u,v) = h{v,w) = h{w,u). Plugging in 
the definitions, we get 

^[A]uvw = ^(/i(u, v) + A{v, w) + A{w, u)) (20) 
j[a[h, Z\]]„„,„ = i(zi(w, u) + A(w, v) + A{u, w)) . (21) 

By the properties (P1)-(P3) of ui, transitivity of a* and definition of A, we 
easily get that A satisfies the triangle inequality: 

A{u, v) < A{u, w) + Z\(w, 1;) 
A{v, w) < A{v, u) + A(u, w) 
A{w, u) < A{w, v) + A{v, u) 



Summing up the three equations, this implies that /3[zi]„^t„ < 27[q;[/i, A]] 



2. Without loss of generality, assume h{u,v) = h{v,'w) = h{u,w) = 1. By 
plugging in the definitions, this implies that 

P[A]uvw = j[a[h, A]]uvw = a[h,A]uw , 

as required. 

This concludes the proof for the general case. As for the bipartite case, (20-21) 
translates to 

P[AU^ = ^iT*iu,v) + T*{v,w)+T*{w,u)) (22) 
j[a[h,A]]uvw = ^{t*{v,u) + t*{w,v) +T*{u,w)) . (23) 

It is trivial to see that the two expressions are identical for any partition r* 
(indeed, they count the number of times we cross the partition from left to right 
when going in a circle on u,v,w: it does not matter in which direction we are 
going). This concludes the loss bound part of Theorems 2 and 3. 

□ 

We place Theorem 2 in the framework used by Cohen et al [8]. There, the 
objective is to find a ranking a that has a low loss measured against h compared 
to the theoretical optimal ranking cjopUmai- Therefore, the problem considered 
there (modulo learning a preference function h) is a combinatorial optimization 
and not a learning problem. More precisely, we define 

o-optimai = argminL(/i, cr) 

(7 

and want to minimize L{h,a)/L{h,<7optimai)- 

Corollary 1. For any V ^ U and preference function h over V , the following 
hound holds: 

E[L(gl'(F), (JopUmal)] < 2 L{h, aopUrnal) ■ (24) 
s 

The corollary is immediate because technically any ranking and in particular 
Coptimai Can be taken as a* in the proof of Theorem 2. 

Corollary 2. Let V C U be an arbitrary subset ofU and let (Toptimai be as above. 
Then, the following hound holds for the pairwise disagreement of the ranking 
Qg{V) with respect to h: 

E[i(/i, Q'liV))] < 3 L{h, a,j,umai)- (25) 

Proof. This result follows directly Corollary 1 and the application of a triangle 
inequality. □ 

The result in Corollary 2 is known from previous work [4,3], where it is proven 
directly without resorting to the intermediate inequality (24). In fact, a better 
bound of 2.5 is known to be achievable using a more complicated algorithm, 
which gives hope for a 1.5 bound improving Theorem 2. 



3.5 Regret Bounds for Bipartite case 

We prove the second part (regret bounds) of Theorem 3. By Observation 1, it 
is enough to prove that R^^„j,(A, D) < K'^i^^^{h, D). Since in the definition of 
^'rank ^^-d R^/qss the expectation over V is outside the min operator, we may 
continue fixing V. Let Dy denote the distribution over r* conditioned on V It 
is now clearly enough to prove 



E [L{Q':,T*)] - min E [L(a,T*)] < E [L{h,T*)] - min E [L{h,T*)] (26) 

Dv,s a Dv h 

We let ijl{u,v) = E^^, [T*(it, w)]. (By pairwise IIA, ijl{u,v) is the same for all 
V such that u, v .) By linearity of expectation, it suffices to show that 

E[L(Q^,/i)] - minL(CT,Ai)] < L{h,^_l) - minL(/i,/^) . (27) 

i> <^ h 

Now let a and h be the minimizers of the min operators on the left and right 
sides, respectively. RecaU that for all u,v e V, h{u,v) can be taken greedily as 
a function of fi{u, v) and fi{v, u) (as in (9)). 

{1 tJ-iu,v) > fi{v,u) 

m(",w) < f^iv,u) (28) 
lM>ti otherwise (equality) . 

Using Lemma 1 and linearity, we write the LHS of (27) as: 

2) i'^Puva[h- a,^i]uv + ^ Puvw iPllA ~ ^[a[a,fi]]) uvw ] 

^ \u<iv u<iv<w / 

and the RHS of (27) as: 

-1 / \ 



Puva[h - h, fi]uv + ^ Puvw^[a[h ~ h, fi]]^ 



Now, clearly for all u,v by construction of h we must have a[h — a,fi\uv ^ 
a[h — h, ij]uv To conclude the proof of the theorem, we define F : (g) ^ R as 
follows: 



F = m - j[a[a, - ij[a[h, - j[a[k ■ (29) 

It now suffices to prove that Fuvw < for all u,v,w 6 V. Clearly F is a 
function of the values of 



/i(a, b) : a,b & {u, w, w} 

hia, b) : a,b £ {u, w, w} (30) 
d{a, b) : a,b € {u, v, w} 



(recall that h depends on /i.) The /x- variables can take values satisfying following 
constraints or all u,v,w £ V: 

^{a,c) < iJL{a,b) + ii{h,c) V {a, 5, c} = {w, w, w} (31) 
^(w, v) + ^(w, w) + /^(w, u) = ^(v, u) + /i(u), v) + /i(u, w) (32) 
/^(a, ^) > \f a,b £ {u, v, w} . (33) 

(the second constraint is obvious for any partition r*.) 

Let P C denote the polytope defined by (31-33) in the variables /i(a, 6) 
for {a, b} C {it, w, w}. We subdivide P into smaller subpolytopes on which the h 
variables are constant. Up to symmetries, we can consider only two cases: (i) h 
induces a cycle on u, v, w and (ii) h is cycle free on it, v, w. 

(i) Without loss of generality, assume h{u,v) ~ h{v,w) = h{w,u) ~ 1. But 
this implies that n{u,v) > ii{v,u)^ fi{v^w) > fi{'w,v) and fi{w,u) > fi{u,w). 
Together with (32) and (33) this implies that fi{u^v) — fi{v,u), fi{v,w) = 
^{w,v) and ^{w,u) = ^{u,w). Consequently 

= ^(/"(w, v) + /i(i;, w) + ^(w, u)) 

and Fuvw = , as required. 

(ii) Without loss of generality, assume h{u,v) — h{v,w) ~ h{u,w) ~ 1. This 
implies that 

fj,(u, v) > ^{v, u) 

^{v,w) > ^{w,v) (34) 
^{u, w) > ^{w, u) . 

Let pep denote the polytope defined by (34) and (31)-(33). Clearly F is 
linear in the 6 fj, variables when all the other variables are fixed. Since F is 
also homogenous in the fi variables, it is enough to prove that P < for /x 
taking values in P' C P, which is defined by adding the constraint, say, 

Ma,6) = 2. 

It is now enough to prove that P < for r* being a vertex of of P'. This 
finite set of cases can be easily checked to be: 

(/i(it, v),fi{v, u), ij.{u, w), ^{w, u),fi{w, v),fi{v, w)) e A U P 

where A = {{0, 0, 1, 0, 0, 1), (1, 0, 1, 0, 0, 0)} 

B = {(.5, .5, .5, .5,0,0), (.5, .5,0,0, .5, .5), (0,0, .5, .5, .5, .5)} . 

The points in B were already checked in case (i) (which is, geometrically, a 
boundary of case (ii)). It remains to check the two points in A. 



— case (0, 0, 1, 0, 0, 1): Plugging in the definitions, one checks that: 

(i[iAuvw = v)h{v, u) + h{w, u)h{u, v)) 

^[a[h, ij]]uvw = ^{{Hu, v)h{v, w) + h{w, v)h{v, u))h{w, u) 
+ ih{v, u)h{u, w) + h{w, u)h{u, v))h{w, v)) 
j[a[h, fi]]uviv = . 

Clearly F could be positive only of Puvw = 1, which happens if and only 
if cither h(w,v)h{v,u) = 1 or h(w,u)h(u,v) = 1. In the former case we 
get that either h{w,v)h{v,u)h{w,u) = 1 or h{v,u)h{u,w)h{w,v) = 1, 
both implying 7[q![/i, ^J]™™ > 1, hence < 0. In the latter case either 
h{w,u)h{u,v)h{w,v) = 1 or h{u,v)h{v,w)h{w,u) = 1. both implying 
again 7[q;[/i, /^JJ^u, > 1 and hence F < 0. 

— case (1, 0, 1, 0, 0, 0):Plugging in the definitions, one checks that: 

/3[m]™™ = + h{v,w)h{w,u)) 

'-f[a[h, fj]]uvw = ■^{iHu,v)h{v,w) + h{w,v)h{v,u))h{w,u) 

+ {h{u, w)h{w, v) + h{v, w)h{w, u))h{v, u)) . 
'-i[a\h, ^i]]uvw = . 

Now F could be positive if and only if either h{w , v)h(v , u) — 1 or 

h{v, w)h(w, u) = 1. In the former case we get that either h{w, v)h(v, u)h{w, u) = 

1 or h{v,u)h{u,w)h{w,v) = 1. both implying j[a[h, fj]]uvw > 1, hence 

< 0. In the latter case either h{v, w)h{'w^ u)h{v, u) — 1 or h{u, v)h{v, w)h{w^ u) — 
1, both implying again 7[a[/i, /uJJti-utD > 1 and hence F < 0. 

This concludes the proof for the bipartite case. □ 



3.6 Time Complexity 

Running QuickSort docs not entail f2{\V\'^) accesses to hu,v The following bound 
on the running time is proven in Section 3.6. 

Theorem 4. The expected number of times QuickSort accesses to the preference 
function h is at most 0{nlogn). Moreover, if only the top k elements are sought 
then the hound is reduced to 0(fclogfc + n) by pruning the recursion. 

It is well known that QuickSort on cycle free tournaments runs in time 
0(n log n). where n is the size of the set we want to sort. That it is true 
for Quicksort on general tournaments is a simple extension (communicated by 
Heikki Mannila) which we present it here for self containment. The second part 
requires more work. 



Proof. Let T{n) be the maximum expected rmming time of QuickSort on a 
possibly cyclic tournament on n vertices in terms of number of comparisons. 
Let G = (V, A) denote a tournament. The main observation is that each vertex 
V G V is assigned to the left recursion with probability exactly outdeg(w)/n and 
to the right with probability indeg(u)/n, over the choice of the pivot. Therefore, 
the expected size of both the left and right recursions is exactly {n — l)/2. 
The separation itself costs n — 1 comparisons. The resulting recursion formula 
T{n) <n-l + 2T{{n - l)/2) clearly solves to T{n) ^ 0{n\ogn). 

Assume now that only the k first elements of the output are sought, that is, 
we are interested in outputting only elements in positions 1, . . . , /c. The algorithm 
which we denote by fc-QuickSort is clear: recurse with min{/e, riij-QuickSort on 
the left side and max{0, k — — l}-QuickSort on the right side, where ul, nn 
are the sizes of the left and right recursions respectively and 0-QuickSort takes 
steps by assumption. To make the analysis simpler, we will assume that whenever 
k > n/8, /c-QuickSort simply returns the output of the standard QuickSort, 
which runs in expected time O(nlogn) ~ 0{n + fclogfc), within the sought 
bound. Fix a tournament G on n vertices, and let t]^{G) denote the running 
time of fc-QuickSort on G, where k < n/8. Denote the (random) left and right 
subtournaments by Gl and G^ respectively, and let = |Gi|,nfl ~ \Gii\ 
denote their sizes in terms of number of vertices. Then, clearly, 

tfc(G) = 71 - 1 + ti„in{fc^„^}(GL) + tniax{0,fc-ni-l}(G_R) . (35) 

Assume by structural induction that for all {k' ,n' : k' < n' < n} and for 
all tournaments G' on n' vertices, Fi[tk'{G')] < cn' + c'fc'logfc' for some global 
c, c' > 0. Then, by conditioning on GljGr, taking expectations on both sides 
of (35) and by induction, 

E[tk{G) I GL,GflJ < 

n — I + CTiL + c' min{fc, n^} logminjfc, 

cnRlnL<k-i + c' max{fc — tt-l — 1, 0} logmax{fe — tt-l — 1, 0} . 
By convexity of the function x i-^ xlogx, 

min{fc, ni} log min{fc, jii} + max{fc — til — 1, 0} log max{fc — — 1, 0} < k log k , (36) 
hence 

E[tk{G) \GL,GR]<n-l + cnL+cnRln^<k-i+c'k\ogk. (37) 

By conditional expectation, 

E[tfc(G)] <n-l + c{n- l)/2 + c'k log k + cE[nRl„^<k~i]. 

To complete the inductive hypothesis, we need to bound E[ni^l„j^<fc_i] which 
is bounded by nPr[nL < fc — 1]. The event {ul < fc — 1}, equivalent to {ur > 
n — fc}, occurs when a vertex of out-degree at least n — k > 7n/8 is chosen as 
pivot. For a random pivot v &V, where V is the vertex set of G, E[outdeg(u)^] < 



+ n/2 < n^/2.9. Indeed, each pair of edges {v,ui) g A and {v,U2) G 
A for ui ^ U2 gives rise to a triangle which is counted exactly twice in the 
cross-terms, hence which upper-bounds 2(g) /n; n/2 bounds the diagonal). 
Thus, Pr[outdeg(w) > 7n/8] = Pr[outdeg(w)2 > 4971^/64] < 0.46 (by Markov). 
Plugging in this value into our last estimate yields 

E[tfc (G)] < n - 1 + c(n - l)/2 + c'fc log fc + 0.46 x cn, 

which is at most cn + c'klogk for c > 30, as required. □ 

4 Discussion 

4.1 History of QuickSort 

The now standard textbook algorithm was discovered by Hoare [16] in 1961. 
Montague and Aslam [20] experiment with QuickSort for information retrieval by 
aggregating rankings from different sources of retrieval. They claim an 0{n log n) 
time bound on the number of comparisons although the proof seems to rely on 
the folklore QuickSort proof without addressing the non-transitivity problem. 
They prove certain combinatorial bounds on the output of QuickSort and pro- 
vide empirical justification to its IR merits. Ailon, Charikar and Newman [4] 
also consider the rank aggregation problem and prove theoretical cost bounds 
for many ranking problems on weighted tournaments. They strengthen these 
bounds by considering nondeterministic pivoting rules (arising from solutions to 
certain ranking LP's). This work was extended by Ailon [3] to deal with rankings 
with ties (in particular, top-fc rankings). Hedge et al [15] and Williamson et al 
[22] derandomize the random pivot selection step in QuickSort for many of the 
combinatorial optimization problems studied by Ailon et al. 

4.2 The decomposition technique 

The technique developed in Lemma 1 is very general and can used for a wide 
variety of loss functions and variants of QuickSort involving nondeterministic 
ordering rules (see [4,3]). Such results would typically amount to bounding 
P[X]uvw/"f[Z]uvw for some carefully chosen functions X,Z (depending on the 
application). 

4.3 Combinatorial Optimization vs. Learning 

In Ailon et al's work [4, 3] the QuickSort algorithm (sometimes referred to there 
as FAS-Pivot) is used to approximate certain NP-Hard (see [5]) weighted in- 
stances of minimum feedback arcset in tournaments. There is much similarity 
between the techniques used in the analyses, but there is also a significant dif- 
ference that should be noted. In the minimum feedback arc-set problem we are 
given a tournament G and wish to find an acyclic tournament H on the same 
vertex set minimizing A(G, H), where A counts the number of edges pointing in 



opposite directions between G^H (or a weighted version thereof). However, the 
cost we are considering is A{G,Hij) for some fixed acychc tournament H„ in- 
duced by some permutation a (the ground truth) . In this work we showed in fact 
that if G' is obtained from G using QuickSort, then E[Z\(G", i/^)] < 2A{G,H^) 
for any a (from Theorem 2). If is the optimal solution to the (weighted) 
minimum feedback arc-set problem corresponding to G. then it is easy to see 
that A{H,H^) < A{G,H) + A{G,H„) < 2A{G,H„). However, recovering G 
is NP-Hard in general. Approximating A{G, H) (as done in the combinatorial 
optimization world) by some constant factor'' 1 -I- e by an acyclic tournament H' 
only guarantees (using trivial arguments) a constant factor of 2 -I- e as follows: 

A{H\H„) < A{G,H')+A{G,H,) < {l+e)A{G,H)+A{G,H,) < {2+e)A{G,H,) 

This work therefore adds an important contribution to [4, 3, 18]. 



4.4 Normalization 

As mentioned earlier, the loss function L used in the bipartite case is not exactly 
the same one used by Balcan et al in [6]. There the total number of "misordered 
pairs" is divided not by (J^) but rather by the number of mixed pairs u, v such 
that T*{u) ^ T*{v) (see (6)). We will not discuss the merits of each choice in 
this work, but will show that the loss bound (first part) of Theorem 3 applies 
to their normalization as well. Indeed, let : n{V) — > be any normalization 
function that depends on a partition, and define a loss 

L(X,r*)=z.(r*)-i^X(^.,z;)r*(«,^.) 

for any X : V x V {X can be a preference function h ot a ranking). 

In [6], for example, i^{t*) is taken as \{u,v : t*{u) < t*{v)}\ and here as (2). 
Since V,t* arc fixed in the loss bound of Theorem 3, this makes no difference 
for the proof. For the regret bound (second part) of Theorem 3 this however 
does not work. Indeed, the pairwise IIA is not enough to ensure that the event 
u,v € V determines v{t*), and we cannot simply swap Ejj and min^-^ as we 
did in Observation 1. Working around this problem seems to require a stronger 
version of IIA which does not seem natural. 



5 Lower Bounds 

Balcan et al [6] prove a lower bound of a constant factor of 2 for the regret bound 
of the algorithm MFAT, defined as the solution to the minimum feedback arc-set 
problem on the tournament V with an edge (u, v) if h{u, v) ~ 1. More precisely, 
they show an example of fixed V, h and r* G n{V) such that the classification 



Kenyon-Mathiew and Schudy [18] recently found such a PTAS for the combinatorial 
optimization problem. 



regret of h tends to 1/2 of the ranking regret of MFAT on V, h. Note that in this 
case, since r* is fixed, the regret and loss are the same thing for both classification 
and ranking. Here we show the following stronger statement which is simpler to 
prove and applies in particular to the specific algorithm MFAT that is argued 
there. 

Theorem 5. For any deterministic algorithm A taking input V ^ U and pref- 
erence function h on V and outputting a ranking a G Siy^ there exists a distri- 
bution D on V, T* such that 

Rrank {A, D) > 2 Rdass {h, D) (38) 

Note that this theorem says that in some sense, no deterministic algorithm 
that converts a preference function into a linear ranking can do better than a ran- 
domized algorithm (on expectation) in the bipartite case. Hence, randomization 
is essentially necessary in this scenario. 

The proof is by an adversarial argument. In our construction, D will always 
put all the mass on a single V,t* (deterministic input), so the loss and regret 
are the same thing, and a similar argument will follow for the loss. Also note 
that the normalization v will have no effect on the result. 

Proof. Fix V = {m, v, w}, and D puts all the weight on this particular V and one 
partition r* (which we adversarially choose below). Assume h{u, v) — h{v, w) = 
h{w,u) = 1 (a cycle). Up to symmetry, there are two options for the output a 
of A on V, h. 

1. cr{u) < cr(w) < a-{w). In this case, the adversary chooses t*{w) = and 
T*{u, v) ~ 1. Clearly Kciass{h, D) now equals 1/3 (h pays only for misorder- 
ing V, w) but Krank{A, D) = 2/3 (cr pays for misordering the pairs it, w and 
V, w). 

2. a-{w) < (7{v) < a{u). In this case, the adversary chooses t*{u) = and 
T*(w, w) ~ 1. Clearly Rc/ass(^, D) now equals 1/3 {h pays only for misorder- 
ing u,w) but ^rank{A^D) = 2/3 (cr pays for misordering the pairs u,v and 
u, w). 

This concludes the proof. 
6 Conclusion 

We described a reduction of the learning problem of ranking to classification. 
The efficiency of this reduction makes it practical for large-scale information 
extraction and search engine applications. A finer analysis of QuickSort is likely 
to further improve our reduction bound by providing a concentration inequality 
for the algorithm's deviation from its expected behavior using the confidence 
scores output by the classifier. Our reduction leads to a competitive ranking 
algorithm that can be viewed as an alternative to the algorithms previously 
designed for the score-based setting. 
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